NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

Yang, Zhuoping; Zhuang, Jinming; Chen, Xingzhen; Jones, Alex; Zhou, Peipei (November 2025, ACM)

GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexi- ble HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88×across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75×speedup through efficient design and overlapping; on graph applications, AGILE reduces soft- ware cache overhead by up to 3.12×and NVMe I/O overhead by up to 2.85×; AGILE also lowers per-thread register usage by up to 1.32×.
more » « less
Free, publicly-accessible full text available November 16, 2026
AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

https://doi.org/10.1145/3712285.3759778

Yang, Zhuoping; Zhuang, Jinming; Chen, Xingzhen; Jones, Alex; Zhou, Peipei (November 2025, ACM)

GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexible HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88 × across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75 × speedup through efficient design and overlapping; on graph applications, AGILE reduces software cache overhead by up to 3.12 × and NVMe I/O overhead by up to 2.85 × ; AGILE also lowers per-thread register usage by up to 1.32 ×.
more » « less
Free, publicly-accessible full text available November 15, 2026
DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1109/RTSS66672.2025.00039

Ji, Shixin; Yang, Zhuoping; Chen, Xingzhen; Zhang, Wei; Zhuang, Jinming; Jones, Alex K; Dong, Zheng; Zhou, Peipei (December 2025, IEEE)

Deep neural network (DNN) models are increasingly deployed in real-time, safety-critical systems such as autonomous vehicles, driving the need for specialized AI accelerators. However, most existing accelerators support only non-preemptive execution or limited preemptive scheduling at the coarse granularity of DNN layers. This restriction leads to frequent priority inversion due to the scarcity of preemption points, resulting in unpredictable execution behavior and, ultimately, system failure. To address these limitations and improve the real-time performance of AI accelerators, we propose DERCA, a novel accelerator architecture that supports fine-grained, intra-layer flexible preemptive scheduling with cycle-level determinism. DERCA incorporates an on-chip Earliest Deadline First (EDF) scheduler to reduce both scheduling latency and variance, along with a customized dataflow design that enables intralayer preemption points (PPs) while minimizing the overhead associated with preemption. Leveraging the limited preemptive task model, we perform a comprehensive predictability analysis of DERCA, enabling formal schedulability analysis and optimized placement of preemption points within the constraints of limited preemptive scheduling. We implement DERCA on the AMD ACAP VCK190 reconfigurable platform. Experimental results show that DERCA outperforms state-of-the-art designs using non-preemptive and layer-wise preemptive dataflows, with less than 5 % overhead in worst-case execution time (WCET) and only 6% additional resource utilization. DERCA is open-sourced on GitHub: https://github.com/arc-research-lab/DERCA
more » « less
Free, publicly-accessible full text available December 2, 2026
ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1145/3716368.3735215

Ji, Shixin; Chen, Xingzhen; Zhuang, Jinming; Zhang, Wei; Yang, Zhuoping; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex; Dong, Zheng; et al (June 2025, ACM)

Real-time systems are widely applied in different areas like autonomous vehicles, where safety is the key metric. However, on the FPGA platform, most of the prior accelerator frameworks omit discussing the schedulability in such real-time safety-critical systems, leaving deadlines unmet, which can lead to catastrophic system failures. To address this, we propose the ART framework, a hardware-software co-design approach that transforms baseline accelerators into “real-time guaranteed" accelerators. On the software side, ART performs schedulability analysis and preemption point placement, optimizing task scheduling to meet deadlines and enhance throughput. On the hardware side, ART integrates the Global Earliest Deadline First (GEDF) scheduling algorithm, implements preemption, and conducts source code transformation to transform baseline HLS-based accelerators into designs targeted for real-time systems capable of saving and resuming tasks. ART also includes integration, debugging, and testing tools for full-system implementation. We demonstrate the methodology of ART on two kinds of popular accelerator models and evaluate on AMD Versal VCK190 platform, where ART meets schedulability requirements that baseline accelerators fail. ART is lightweight, utilizing <0.5% resources. With about 100 lines of user input, ART generates about 2.5k lines of accelerator code, making it a push-button solution.
more » « less
Free, publicly-accessible full text available June 29, 2026
Towards Accelerator Customization in Real-time Safety-critical Systems

https://doi.org/10.1145/3706628.3708841

Ji, Shixin; Chen, Xingzhen; Zhang, Wei; Yang, Zhuoping; Zhuang, Jinming; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex K; Dong, Zheng; et al (February 2025, ACM)

Free, publicly-accessible full text available February 27, 2026
SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators

https://doi.org/10.1109/ISVLSI61997.2024.00095

Ji, Shixin; Yang, Zhuoping; Chen, Xingzhen; Cahoon, Stephen; Hu, Jingtong; Shi, Yiyu; Jones, Alex K; Zhou, Peipei (July 2024, IEEE)

Embodied carbon has been widely reported as a significant component in the full system lifecycle of various computing systems green house gas emissions. Many efforts have been undertaken to quantify the elements that comprise this embodied carbon, from tools that evaluate semiconductor manufacturing to those that can quantify different elements of the computing system from commercial and academic sources. However, these tools cannot easily reproduce results reported by server vendors' product carbon reports and the accuracy can vary substantially due to various assumptions. Furthermore, attempts to determine green house gas contributions using bottom-up methodologies often do not agree with system-level studies and are hard to rectify. Nonetheless, given there is a need to consider all contributions to green house gas emissions in datacenters, we propose SCARIF, the Server Carbon including Accelerator Reporter with Intelligence-based Formulation tool. SCARIF has three main contributions: (1) We first collect reported carbon cost data from server vendors and design statistic models to predict the embodied carbon cost so that users can get the embodied carbon cost for their server configurations. (2) We provide embodied carbon cost if users configure servers with accelerators including GPUs, and FPGAs. (3) By using case studies, we show that certain design choices of data center management might flip by the insight and observation from using SCARIF. Thus, SCARIF provides an opportunity for large-scale datacenter and hyperscaler design. We release SCARIF as an open-source tool at https://github.com/arc-research-lab/SCARIF.
more » « less
Full Text Available
Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous Chiplets

https://doi.org/10.1109/ASP-DAC58780.2024.10473961

Yang, Zhuoping; Ji, Shixin; Chen, Xingzhen; Zhuang, Jinming; Zhang, Weifeng; Jani, Dharmesh; Zhou, Peipei (March 2024, Asia and South Pacific Design Automation Conference (ASP-DAC))

Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever increasing computing demands in today’s data centers. Heterogeneous computing with domain-specific architectures (DSAs) brings many opportunities when scaling up and scaling out the computing system. In particular, heterogeneous chiplet architecture is favored to keep scaling up and scaling out the system as well as to reduce the design complexity and the cost stemming from the traditional monolithic chip design. However, how to interconnect computing resources and orchestrate heterogeneous chiplets is the key to success. In this paper, we first discuss the diversity and evolving demands of different AI workloads. We discuss how chiplet brings better cost efficiency and shorter time to market. Then we discuss the challenges in establishing chiplet interface standards, packaging, and security issues. We further discuss the software programming challenges in chiplet systems.
more » « less
Full Text Available

Search for: All records